1. Introduction

MAVEQC is a flexible R-package that provides QC analysis of Saturation Genome Editing (SGE) experimental data. Available under GPL 3.0 from https://github.com/wtsi-hgi/MAVEQC


2. Screen QC

Displays QC plots and statistics for all samples for QC.


2.1. Sample Sheet


2.2. Run Sample QC

2.2.1. Read Length Distribution

Displays the percentage of reads for each sample, based on 50 nucleotide increments, using the total number of raw reads.

Note: expected read length is 300, which is the sequencing length.

Note: the read length here is deducted by the length of primers, which assumes the reads in the input files have primers.(see 2.1. Sample Sheet: quants_append_start and quants_append_end)

Pass criterion: more than 90% of reads are longer than 200 nucleotides


2.2.2. Missing Variants

Stats of missing variants in the library

Pass criterion: less than 1% of variants (library sequences) are missing


Records of missing variants in the library

Note: Unique indicates that a template sequence occurs only once in the VaLiAnT meta file.This is important as a template sequence can occur more than once depending on the mutation types applied in VaLiAnT.


2.2.3. Total Reads (Counts)

Displays the total number of reads per sample. Filtering based on 1-dimensional Kmean clustering that excludes unique sequences with low read counts.

  • Accepted reads: Total read count for all unique sequences with sufficient reads.
  • Excluded reads: Total read count for all unique sequences with insufficient reads.

Total Reads: the total number of raw reads

Pass criterion: more than 1,000,000 total reads

Note: No filtering happens in this step, only for clustering sequences (variants) based on read abundance


2.2.4. Accepted Reads (Percentage)

Displays the percentage of library reads vs non-library reads (ie. Reference, PAM and Unmapped) for Accepted Reads (see 2.2.3 explanation).

  • Library Reads: Percentage reads mapping to template oligo sequences, including intended variants.
  • Reference Reads: Percentage reads mapping to Reference.
  • PAM Reads: Percentage reads mapping to PAM/Protospacer Protection Edits (PPEs), without intended variant.
  • Unmapped Reads: Percentage of Unmapped Reads (not mapped to library sequences, PAM sequence, and reference sequence).
  • Library Coverage: Mean read count per template oligo sequence.

Pass criterion: more than 40% of accepted reads are library reads

Defines the mean read count per template oligo sequence (dividing the total number of library reads by the total number of library sequences).

Pass criterion: library coverage is more than 100


2.2.5. Genomic Coverage

Distribution of variants across targeton region based on log2(count+1) values.

Note: missing variants (0 count in the library) are not included.

Low Abundance cutoff: this cutoff is used to determine if the variant is low abundance (the green dashed line in the figure)

% Low Abundance: the percentage of low-abundance variants

Pass criterion: the percentage of low-abundance variants is lower than 30%


2.2.6. Genomic Position Percentage

Displays distribution of “LOF” (loss-of-function) vs all “Other” variants across the targeton region, based on read percentages for reference timepoint. Requires concordant distribution of LOF and Other variants.

Note: missing variants (0 count in the library) are not included.


2.3. Run Experiment QC

2.3.1. Sample Correlations



2.3.2. Sample PCA


2.3.3. Fold Change (by category)

condition_Day7_vs_Day4

condition_Day15_vs_Day4


2.3.4. Fold Change (by position)

condition_Day7_vs_Day4

condition_Day15_vs_Day4



3. QC Results

Summarising the final results, below are the cutoffs using for PASS/FAIL


3.1. Sample QC Results


4. Methods and Glossary

4.1. Methods

4.1.1. Methods of generating accepted reads (refer to 2.2.3)

  1. Apply 1D Kmean clustering on each sequence (variant sequence) using log2 read count. Low read count sequences are removed in this step
  2. A valid sequence must have at least 5 count in at least 25% of the samples in an experiment.

4.1.2. Methods of DESeq2 calculation (refer to 2.3.3 and 2.3.4)

  1. using the total number of accepted reads to calculate the size factor which is applied in DESeq2 normalisation
  2. run DESeq2 for each consequence
  3. select synonymous variant and intronic variant as the control, then calculate the median log2 fold change of the control variants
  4. re-calculate log2 fold change of other consequences by deducting the median log2 fold change of the control variants
  5. re-calculate p value and adjusted p value

4.2. Glossary

4.2.1. Glossary of DESeq2 calculation (refer to 2.3.3 and 2.3.4)

  • log2FoldChange: the initial log2FoldChange from DESeq2 using all the accepted reads
  • lfcSE: the initial lfcSE from DESeq2 using all the accepted reads
  • padj: the initial padj from DESeq2 using all the accepted reads
  • adj_log2FoldChange: the re-calculated log2FoldChange by deducting the median control value
  • adj_score: the re-calculated score derived from adj_log2FoldChange
  • adj_pval: the re-calculated p-value derived from adj_log2FoldChange
  • adj_fdr: the re-calculated FDR (adjusted p-value)
  • stat: adj_log2FoldChange > 0 is enriched, adj_log2FoldChange < 0 is depleted